Mini-Project: SVM & LR Classification

Rain in Australia: Predict Rain Tomorrow

Created by An Nguyen, Andy Ho, Jodi Pafford, Tori Wheelis

Team Lead: Andy Ho
February 14, 2019

Create Models

The following is code for pre-processing, defined functions used, logistic regression, and support vector machines.

In [1]:
import pandas as pd
import numpy as np

##Load original dataset.  No longer needed.
##The cleaning process is very computationally expensive therefore a the 'ranifall.csv' file was created for later use. 
#rainfall_original = pd.read_csv('weatherAus.csv') 

##load pre-generated rainfall.csv file.
rainfall = pd.read_csv('rainfall.csv', index_col=0) 
rainfall.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 140787 entries, 0 to 140786
Data columns (total 23 columns):
Date             140787 non-null object
Location         140787 non-null object
MinTemp          140787 non-null float64
MaxTemp          140787 non-null float64
Rainfall         140787 non-null float64
Evaporation      97184 non-null float64
Sunshine         89329 non-null float64
WindGustDir      134862 non-null float64
WindGustSpeed    134862 non-null float64
WindDir9am       140787 non-null float64
WindDir3pm       140787 non-null float64
WindSpeed9am     140787 non-null float64
WindSpeed3pm     140787 non-null float64
Humidity9am      140787 non-null float64
Humidity3pm      140787 non-null float64
Pressure9am      129190 non-null float64
Pressure3pm      129190 non-null float64
Cloud9am         107253 non-null float64
Cloud3pm         107253 non-null float64
Temp9am          140787 non-null float64
Temp3pm          140787 non-null float64
RainToday        140787 non-null int64
RainTomorrow     140787 non-null int64
dtypes: float64(19), int64(2), object(2)
memory usage: 25.8+ MB
In [2]:
#Functions to find the average value using the bracketing values around the NaN's.  
    #For instance if a city's 'MinTemp' has 34, 32, NaN, NaN, 55 recorded 
    #the function will average 32 and 55 for the first NaN: (32+55)/2 = 43.5 
    #and average the above value and 55 for the second NaN: (43.5+55)/2 = 49.25
#Will only use values if they are from the same city.
#If NaN is the earliest timepoint for a given city the next timepoint with no NaN will be given instead of the mean.
#If NaN is the latest timepoint for a given city the previous timepoint with no NaN will be given instead of the mean.

def impute_by_city(cities,variables):
    for c in cities:
        #aPrse out observations from a single city.
        temp = rainfall[rainfall.Location == c]
        
        #Interate through all observations of the temp data file.
        i = min(temp.index)
        while i <= max(temp.index):
            for v in variables:
                #Check to see if there are values recorded for the variable, will pass over if all are NaN.
                if pd.isna(temp[v]).all():
                    pass
                
                #Check to see if a single value is NaN.
                elif pd.isna(temp[v][i]):
                    #Find the mean of bracketing values and impute into main dataframe.
                    temp[v][i] = find_mean(temp[v], i)
                    rainfall[v][i] = temp[v][i]
            i = i + 1       

#Find mean of bracketing values.
def find_mean(templist, index):
    #If NaN is earliest timepoint for the city take the next value that is not NaN.
    if index == min(templist.index): 
        return find_top(templist, index)
    
    #If latest timepoint for the city take the previous value that is not NaN.
    elif index == max(templist.index): 
        return find_bottom(templist, index)
    
    else:
        #Find previous non-NaN value.
        bottom = find_bottom(templist, index) 
        #Find next non-NaN value.
        top = find_top(templist, index) 
        
    #If current value is not from the latest timepoint for the city but there are no more non-NaN value recorded
    #after this value then the previous non-NaN value will be taken.
    if pd.isna(top): 
        return bottom
    

    else:
        mean = (top + bottom)/2
        return mean

#Find previous non-NaN value.
def find_bottom(templist, index):
    while pd.isna(templist[index-1]):
        index = index-1
    bottom = templist[index-1]
    return bottom

#Find next non-NaN value.
#If there are no more non-NaN values return the previous non-NaN value.
def find_top(templist, index):
    while pd.isna(templist[index+1]):
        index = index+1
        if index == max(templist.index):
            top = np.nan
            return top
    top = templist[index+1]
    return top   
In [3]:
##Code for first run data cleaning, no longer needed after 'rainfall.csv' was created at the end of cleaning process.

#rainfall = rainfall_original.copy()

##'RISK_MM' was used by creator of dataset to extrapolate response variable, 'RainTomorrow.'  Needs to be dropped to not 
##influence prediction.
#rainfall.drop(["RISK_MM"], axis=1, inplace=True)

##Drop any observation with no record of rainfall for the day.  Cannot be imputed.
#rainfall.dropna(subset=["RainToday"], inplace=True)

#Reset the Index of each observation to match it's iloc, get rid of gaps between Index integers.
#rainfall = rainfall.reset_index(drop=True)
#rainfall.info()
In [4]:
##can be skipped if rainfall.csv already generated!

##set the cardinal directions to degrees.
#directions = {'N':0, 'NNE':22.5, 'NE':45, 'NE':45, 'ENE':67.5, 'E':90, 'ESE':112.5, 'SE':135, 'SSE':157.5, 'S':180,\
#              'SSW':202.5, 'SW':225, 'WSW':247.5, 'W':270, 'WNW':292.5, 'NW':315, 'NNW':337.5}

##Replace cardianl direction to their corresponding degrees.
#rainfall = rainfall.replace(directions) 

#Get name of all cities in the data frame.
#cities = rainfall.Location.unique() 

#c_variables = []
#d_variables = []

##change 'Yes' and 'No' to 1 and 0 respectively.
#rainfall.RainToday = rainfall.RainToday=='Yes'
#rainfall.RainToday = rainfall.RainToday.astype(np.int)
#rainfall.RainTomorrow = rainfall.RainTomorrow=='Yes'
#rainfall.RainTomorrow = rainfall.RainTomorrow.astype(np.int)

##Find all variables with continous data type.
#for l in list(rainfall):
#    if (rainfall[l].dtypes == 'float64'):
#        c_variables.append(l)
#    else:
#        d_variables.append(l)
In [5]:
##can be skipped if rainfall.csv already generated! Very expensive, 'rainfall.csv' can be uploaded from working directory

##Impute values to NaN's and save to csv file for later use.
#impute_by_city(cities, c_variables)
#rainfall.to_csv("rainfall.csv", sep=',', index=True)
In [6]:
#Variables 'Evaporation' and 'Sunshine' contained many missing values, too many to be imputed.
rainfall = rainfall.drop(['Evaporation', 'Sunshine'], axis = 1)

#Get name of all cities in the data frame.
l = list(rainfall.Location.unique())

#Drop all observations with NaN's.  These are values that could not be imputed using the above code.
rainfall.dropna(subset = list(rainfall), inplace = True)

#List all cities that were dropped
for i in l:
    if i not in rainfall.Location.unique():
        print(i)
        
#'Date' and 'Location' variables not needed for prediction. 
rainfall = rainfall.drop(['Date', 'Location'], axis = 1) 

##Make a copy of main data frame to be use by Logistic Regression
lr_rainfall = rainfall.copy()

##Make a copy of main data frame to be use by Support Vector Machines
svm_rainfall = rainfall.copy()
BadgerysCreek
Newcastle
NorahHead
Penrith
Tuggeranong
MountGinini
Nhil
Dartmoor
GoldCoast
Adelaide
Albany
Witchcliffe
SalmonGums
Walpole

Logistic Regression

In [7]:
from sklearn.model_selection import ShuffleSplit as ss

#Assign values to response variable, y, and explanatory variables, x.
if 'RainTomorrow' in lr_rainfall:
    #Response variable is 'RainTomorrow'
    y = lr_rainfall['RainTomorrow'].values
    
    #Remove response variable from dataframe
    del lr_rainfall['RainTomorrow']
    
    #Everything else is the explanatory variables used in prediction.
    x = lr_rainfall.values 
    
#Split our data into training and testing sets, 80% of data will be in the training set and 20% the testing set.
#Data will be process this way 5 times, value can be change per user's judgement.  It is recommended that number
#of iterations be at least 2 so that standard deviations can be computed.
num_cv_iterations = 5
num_instances = len(y)
cv_object = ss(n_splits=num_cv_iterations, test_size  = 0.2)
In [8]:
from sklearn import metrics as mt
from sklearn.preprocessing import StandardScaler as sts
from sklearn.linear_model import LogisticRegression as lr
import time

column_names = lr_rainfall.columns
weights = []
weights_array = []

scl_obj = sts()
t0=time.time()

#Split the data into training and testing set 5 different ways, iterate through each way.
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(x,y)):
    
    #Standardize the explanatory variables of the training and testing sets' means to be around 0 with a standard deviation of 1.  
    #Each value is subtracted from the mean and divided by the standard deviation of the whole dataset.
    scl_obj.fit(x[train_indices])
    X_train_scaled = scl_obj.transform(x[train_indices])
    X_test_scaled = scl_obj.transform(x[test_indices])

    #Perform Logistic Regression on training set.
    lr_clf = lr(penalty='l2', C=0.05)
    lr_clf.fit(X_train_scaled,y[train_indices])

    #Perform prediction using the scaled explanatory variables of the testing set.
    y_hat = lr_clf.predict(X_test_scaled)
    
    #Find accuracy and confusion matrix of predition above
    acc = mt.accuracy_score(y[test_indices],y_hat)
    conf = mt.confusion_matrix(y[test_indices],y_hat)
    print("")
    print('accuracy:', acc )
    print(conf )
    print ("Time to Run:", time.time()-t0)
    
    #Get the names of each variables and weight computed by the regression model
    zip_vars = pd.Series(lr_clf.coef_[0].T, index=column_names)
    for name, coef in zip_vars.items():
        #Print out names and weight of each variable
        print(name, 'has weight of', coef)
        
        #Add weights computed by current iteration of training and testing split.
        weights.append(coef)
    
    #Add all the weights of each iteration into one master array.  
    weights_array.append(weights)
    
    #reset weights variable for next iteration.
    weights = []

#Convert weights_array into a numpy array.
weights_array = np.array(weights_array)
accuracy: 0.8519194710352315
[[15104   861]
 [ 2140  2161]]
Time to Run: 0.7858941555023193
MinTemp has weight of 0.04983218614522581
MaxTemp has weight of 0.10297354517267217
Rainfall has weight of 0.08378214162758452
WindGustDir has weight of 0.027575642347826227
WindGustSpeed has weight of 0.7152565343917393
WindDir9am has weight of -0.10468039026722778
WindDir3pm has weight of 0.08822222298037855
WindSpeed9am has weight of -0.06640452938618237
WindSpeed3pm has weight of -0.25462705181686496
Humidity9am has weight of 0.0840328635863862
Humidity3pm has weight of 1.1493285608909136
Pressure9am has weight of 1.0943986412815792
Pressure3pm has weight of -1.4980018881356916
Cloud9am has weight of 0.1581605881899202
Cloud3pm has weight of 0.3642741840200382
Temp9am has weight of 0.20636517390608963
Temp3pm has weight of -0.39085860640847
RainToday has weight of 0.19825193414056677

accuracy: 0.8520181584920556
[[15109   865]
 [ 2134  2158]]
Time to Run: 1.7356648445129395
MinTemp has weight of 0.06592232942265776
MaxTemp has weight of 0.11902683597044562
Rainfall has weight of 0.08069798044425627
WindGustDir has weight of 0.03318402296943354
WindGustSpeed has weight of 0.7208411385044707
WindDir9am has weight of -0.10894943054026729
WindDir3pm has weight of 0.08682974818394588
WindSpeed9am has weight of -0.06940681715496491
WindSpeed3pm has weight of -0.2611260525011706
Humidity9am has weight of 0.08050778970933573
Humidity3pm has weight of 1.1535158939165653
Pressure9am has weight of 1.0083223317738543
Pressure3pm has weight of -1.426170218628813
Cloud9am has weight of 0.15505864016014465
Cloud3pm has weight of 0.3706004190720147
Temp9am has weight of 0.15516288980064608
Temp3pm has weight of -0.36682843120883424
RainToday has weight of 0.1931160714942422

accuracy: 0.8568538438764433
[[15186   806]
 [ 2095  2179]]
Time to Run: 2.4800310134887695
MinTemp has weight of 0.08176397382550463
MaxTemp has weight of 0.09305528582555005
Rainfall has weight of 0.08508842379353623
WindGustDir has weight of 0.01921258011439139
WindGustSpeed has weight of 0.7175164766065243
WindDir9am has weight of -0.10615049005614088
WindDir3pm has weight of 0.0837977325925631
WindSpeed9am has weight of -0.0747358198670696
WindSpeed3pm has weight of -0.2610834866048604
Humidity9am has weight of 0.08687025984624999
Humidity3pm has weight of 1.1262642334750554
Pressure9am has weight of 1.0587410415496212
Pressure3pm has weight of -1.477563438136025
Cloud9am has weight of 0.14968346499452884
Cloud3pm has weight of 0.37233161956674343
Temp9am has weight of 0.1595803631308553
Temp3pm has weight of -0.36978534175925676
RainToday has weight of 0.18606033492721227

accuracy: 0.8534984703444192
[[15195   827]
 [ 2142  2102]]
Time to Run: 3.30551815032959
MinTemp has weight of 0.07523116264615888
MaxTemp has weight of 0.0634854643528297
Rainfall has weight of 0.08864519089490831
WindGustDir has weight of 0.03079248389428619
WindGustSpeed has weight of 0.7070305034799933
WindDir9am has weight of -0.10432019098563038
WindDir3pm has weight of 0.07339986171561551
WindSpeed9am has weight of -0.07692705555305165
WindSpeed3pm has weight of -0.26031181483852145
Humidity9am has weight of 0.07929315719351873
Humidity3pm has weight of 1.1311447781673034
Pressure9am has weight of 1.0367445081535953
Pressure3pm has weight of -1.46684589988709
Cloud9am has weight of 0.15356471707118446
Cloud3pm has weight of 0.37128125793812505
Temp9am has weight of 0.15608992518131995
Temp3pm has weight of -0.3424432997873112
RainToday has weight of 0.19375498943592367

accuracy: 0.8551761571104313
[[15184   825]
 [ 2110  2147]]
Time to Run: 4.098825931549072
MinTemp has weight of 0.05975176009832831
MaxTemp has weight of 0.05357442060719897
Rainfall has weight of 0.08630210493992105
WindGustDir has weight of 0.031754151228052564
WindGustSpeed has weight of 0.7051886214344036
WindDir9am has weight of -0.10566361750201346
WindDir3pm has weight of 0.08553790997646463
WindSpeed9am has weight of -0.062053985792011254
WindSpeed3pm has weight of -0.2559856585911852
Humidity9am has weight of 0.09236965515179811
Humidity3pm has weight of 1.1308878741917048
Pressure9am has weight of 1.0711434750047055
Pressure3pm has weight of -1.477310489688006
Cloud9am has weight of 0.1569745963881575
Cloud3pm has weight of 0.3719800893967147
Temp9am has weight of 0.2190811341937669
Temp3pm has weight of -0.35241914602929714
RainToday has weight of 0.19836379254543482
In [9]:
import plotly as ply

#Run at the start of plotly.
ply.offline.init_notebook_mode() 

#Get mean weights for each variable
mean_weights = np.mean(weights_array,axis = 0)

#Get standard deviation of weights for each variable
std_weights = np.std(weights_array,axis = 0)

#Make one array with variable names as the Index, and a column of means and a column of standard deviation for each variable.
final_array = pd.DataFrame(data={'mean':mean_weights, 'std':std_weights}, index = column_names)

#Sort the variable in ascending order using the 'mean' column.
final_array = final_array.sort_values(by=['mean'])

#Error bar.
error_y=dict(type='data', array=final_array['std'].values, visible=True)

#Graph mean with standard deviation of each variable.
graph1 = {'x': final_array.index, 'y': final_array['mean'].values, 'error_y':error_y,'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

ply.offline.iplot(fig)

#Grab coefficents with absolute values greater than 0.5, this can be set by the user.
cutoff = 0.5
lr_voi = []
for index, columns in final_array.iterrows():
    if (columns['mean'] > cutoff) or (columns['mean'] < -cutoff):
        lr_voi.append(index)
In [10]:
from matplotlib import pyplot as plt

#Add response variable back to expalantory dataset.
lr_rainfall['RainTomorrow'] = y 

#Group observations by the response variable, 1's together and 0's together.
df_grouped = lr_rainfall.groupby(['RainTomorrow'])

#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = lr_voi

for v in vars_to_plot:
    plt.figure(figsize=(10,4))
    plt.subplot(1,1,1)
    ax = df_grouped[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Original)')

Support Vector Machines

In [11]:
#Assign values to response variable, y, and explanatory variables, x.
if 'RainTomorrow' in svm_rainfall:
    #Response variable is 'RainTomorrow'
    y = svm_rainfall['RainTomorrow'].values
    
    #Remove response variable from dataframe
    del svm_rainfall['RainTomorrow']
    
    #Everything else is the explanatory variables used in prediction.
    x = svm_rainfall.values 
    
#Split our data into training and testing sets, 80% of data will be in the training set and 20% the testing set.
#Data will be process this way 5 times, value can be change per user's judgement.  It is recommended that number
#of iterations be at least 2 so that standard deviations can be computed.
num_cv_iterations = 5
num_instances = len(y)
cv_object = ss(n_splits=num_cv_iterations, test_size  = 0.2)
In [12]:
from sklearn.svm import SVC 

weights = []
weights_array = []
t0=time.time()

#Split the data into training and testing set 5 different ways, iterate through each way.
for train_indices, test_indices in cv_object.split(x,y): 
    
    #Standardize the explanatory variables of the training and testing sets' means to be around 0 with a standard deviation of 1.  
    #Each value is subtracted from the mean and divided by the standard deviation of the whole dataset.
    scl_obj.fit(x[train_indices])
    X_train_scaled = scl_obj.transform(x[train_indices])
    X_test_scaled = scl_obj.transform(x[test_indices])

    #Perform Support Vector Machine on training set.
    svm_clf = SVC(C=0.5, kernel='linear', degree=3, gamma='auto')
    svm_clf.fit(X_train_scaled, y[train_indices])

    #Perform prediction using the scaled explanatory variables of the testing set.
    y_hat = svm_clf.predict(X_test_scaled)
    
    #Find accuracy and confusion matrix of predition above
    acc = mt.accuracy_score(y[test_indices],y_hat)
    conf = mt.confusion_matrix(y[test_indices],y_hat)
    print("")
    print('accuracy:', acc )
    print(conf)
    print ("Time to Run:", time.time()-t0)
    
    #Get the names of each variables and weight computed by the regression model
    zip_vars = pd.Series(svm_clf.coef_[0],index=column_names)
    for name, coef in zip_vars.items():
        #Print out names and weight of each variable
        print(name, 'has weight of', coef)
        
        #Add weights computed by current iteration of training and testing split.
        weights.append(coef)
    
    #Add all the weights of each iteration into one master array.  
    weights_array.append(weights)
    
    #reset weights variable for next iteration.
    weights = []
    
#Convert weights_array into a numpy array.
weights_array = np.array(weights_array)  
accuracy: 0.8542879699990131
[[15375   659]
 [ 2294  1938]]
Time to Run: 281.8026978969574
MinTemp has weight of -0.012158372126805261
MaxTemp has weight of 0.3038183268447483
Rainfall has weight of 0.10587639380673863
WindGustDir has weight of 0.023177102593422205
WindGustSpeed has weight of 0.4730351444820826
WindDir9am has weight of -0.07728205155865453
WindDir3pm has weight of 0.07072174598579295
WindSpeed9am has weight of -0.027071034055893506
WindSpeed3pm has weight of -0.22178865815499194
Humidity9am has weight of -0.012453785275738483
Humidity3pm has weight of 0.8192558152286438
Pressure9am has weight of 0.7787257206582581
Pressure3pm has weight of -1.0308269904819554
Cloud9am has weight of 0.06977445189477294
Cloud3pm has weight of 0.17410514764605978
Temp9am has weight of 0.005879472928455698
Temp3pm has weight of -0.33310772788058784
RainToday has weight of 0.12847296673726305

accuracy: 0.8521168459488799
[[15258   741]
 [ 2256  2011]]
Time to Run: 572.7877209186554
MinTemp has weight of -0.026334770581797784
MaxTemp has weight of 0.3057615714451458
Rainfall has weight of 0.09680065443626518
WindGustDir has weight of 0.007041534532504556
WindGustSpeed has weight of 0.47072726379951746
WindDir9am has weight of -0.07287352904438649
WindDir3pm has weight of 0.08137767089124281
WindSpeed9am has weight of -0.033350610137404146
WindSpeed3pm has weight of -0.21482216435379087
Humidity9am has weight of -0.010602104149256775
Humidity3pm has weight of 0.8157879801628951
Pressure9am has weight of 0.7641769498741269
Pressure3pm has weight of -1.012111968403815
Cloud9am has weight of 0.07888387022876486
Cloud3pm has weight of 0.1796902570963539
Temp9am has weight of 0.037149214522088414
Temp3pm has weight of -0.354975426999772
RainToday has weight of 0.13529861711413105

accuracy: 0.8516234086647587
[[15258   638]
 [ 2369  2001]]
Time to Run: 853.3642799854279
MinTemp has weight of -0.017642455688474
MaxTemp has weight of 0.2770454979297483
Rainfall has weight of 0.10749202794295343
WindGustDir has weight of 0.012359114323771792
WindGustSpeed has weight of 0.46609710873531185
WindDir9am has weight of -0.06560991219851076
WindDir3pm has weight of 0.06972254946606427
WindSpeed9am has weight of -0.03445843407121174
WindSpeed3pm has weight of -0.2101594104792639
Humidity9am has weight of -0.013477012985731562
Humidity3pm has weight of 0.8106205553922337
Pressure9am has weight of 0.7776601130831295
Pressure3pm has weight of -1.0321759864882551
Cloud9am has weight of 0.07368242040365658
Cloud3pm has weight of 0.17131074901317334
Temp9am has weight of 0.03324840355671199
Temp3pm has weight of -0.34047909262221765
RainToday has weight of 0.12969582704363347

accuracy: 0.8559163130366131
[[15305   700]
 [ 2220  2041]]
Time to Run: 1134.5163350105286
MinTemp has weight of -0.032272167881501446
MaxTemp has weight of 0.2914321176681369
Rainfall has weight of 0.10085796791781831
WindGustDir has weight of 0.021398699279984612
WindGustSpeed has weight of 0.4682637167145458
WindDir9am has weight of -0.0775149272010367
WindDir3pm has weight of 0.07139927785379996
WindSpeed9am has weight of -0.02007749821973448
WindSpeed3pm has weight of -0.2121415831802551
Humidity9am has weight of -0.012654669047151401
Humidity3pm has weight of 0.8357375333598611
Pressure9am has weight of 0.7408511988928694
Pressure3pm has weight of -0.9930128601945398
Cloud9am has weight of 0.07324404859036804
Cloud3pm has weight of 0.1744157590337636
Temp9am has weight of 0.012824487521470473
Temp3pm has weight of -0.30096142026286543
RainToday has weight of 0.129942084706272

accuracy: 0.8517220961215829
[[15271   660]
 [ 2345  1990]]
Time to Run: 1415.7164871692657
MinTemp has weight of -0.02631767323055101
MaxTemp has weight of 0.293152164045523
Rainfall has weight of 0.10478016613774344
WindGustDir has weight of 0.009199744126476617
WindGustSpeed has weight of 0.4654547317752531
WindDir9am has weight of -0.07412868716073717
WindDir3pm has weight of 0.07835132266120581
WindSpeed9am has weight of -0.027131236365448785
WindSpeed3pm has weight of -0.20682381647770853
Humidity9am has weight of -0.00821442093274527
Humidity3pm has weight of 0.8213471844364904
Pressure9am has weight of 0.7672990285500418
Pressure3pm has weight of -1.0147235803203785
Cloud9am has weight of 0.06879586752984324
Cloud3pm has weight of 0.16868445226828044
Temp9am has weight of 0.028933535975028235
Temp3pm has weight of -0.3367889668568296
RainToday has weight of 0.13230781864285746
In [13]:
#look at the support vectors
print(svm_clf.support_vectors_.shape)
print(svm_clf.support_.shape)
print(svm_clf.n_support_ )
(28292, 18)
(28292,)
[14149 14143]
In [14]:
#Run at the start of plotly.
ply.offline.init_notebook_mode() 

#Get mean weights for each variable
mean_weights = np.mean(weights_array,axis = 0)

#Get standard deviation of weights for each variable
std_weights = np.std(weights_array,axis = 0)

#Make one array with variable names as the Index, and a column of means and a column of standard deviation for each variable.
final_array = pd.DataFrame(data={'mean':mean_weights, 'std':std_weights}, index = column_names)

#Sort the variable in ascending order using the 'mean' column.
final_array = final_array.sort_values(by=['mean'])

#Error bar.
error_y=dict(type='data', array=final_array['std'].values, visible=True)

#Graph mean with standard deviation of each variable.
graph1 = {'x': final_array.index, 'y': final_array['mean'].values, 'error_y':error_y,'type': 'bar'}

fig_svm = dict()
fig_svm['data'] = [graph1]
fig_svm['layout'] = {'title': 'Support Vector Machines Weights, with error bars'}

ply.offline.iplot(fig_svm)

#Grab coefficents with absolute values greater than 0.5, this can be set by the user.
cutoff = 0.5
svm_voi = []
for index, columns in final_array.iterrows():
    if (columns['mean'] > cutoff) or (columns['mean'] < -cutoff):
        svm_voi.append(index)
In [15]:
#Make a dataframe of the training data from the last iteration above.
df_tested_on = svm_rainfall.iloc[train_indices]

#Get the support vectors from the trained model.
df_support = df_tested_on.iloc[svm_clf.support_,:]

#Add response variable back to expalantory datasets.
df_support['RainTomorrow'] = y[svm_clf.support_]
svm_rainfall['RainTomorrow'] = y

#Group observations by the response variable, 1's together and 0's together.
df_grouped_support = df_support.groupby(['RainTomorrow'])
df_grouped = svm_rainfall.groupby(['RainTomorrow'])

#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = svm_voi

for v in vars_to_plot:
    plt.figure(figsize=(10,4))
    #Plot support vector stats.
    plt.subplot(1,2,1)
    ax = df_grouped_support[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Instances chosen as Support Vectors)')
    
    #Plot original distributions
    plt.subplot(1,2,2)
    ax = df_grouped[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Original)')
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Model Advantages

Both Logistic Regression (LR) and Support Vector Machine (SVM) are algorithms used for classification problems. LR and SVM with linear Kernel generally perform comparably in practice. While this is true in practice, there are 2 differences to note:

  • Logistic loss diverges faster than hinge loss. So, in general, it will be more sensitive to outliers.
  • Logistic loss does not go to zero even if the point is classified sufficiently confidently. This might lead to minor degradation in accuracy

For the purposes of the "Predict Rain Tomorrow in Australia" dataset, the accuracy for both LR and SVM hovered around 85% - no superior performance detected from one to the other. This is probably the case since our pre-processing removed any outliers from the data set. We ran the alorithm 5 times for both to see if it would have any signifiant deviation - both cases proved to have the same accuracy. What was noticable was the amount of time required to run LR and SVM. We included a time gauge to measure the amount of time it required, here were the results:

  • Logistic Regression (seconds)

    • 1st: 0.94
    • 2nd: 1.85
    • 3rd: 2.89
    • 4th: 3.75
    • 5th: 4.68
  • Support Vector Machine (seconds)

    • 1st: 405
    • 2nd: 856
    • 3rd: 1259
    • 4th: 1706
    • 5th: 2174

This was a large dataset so SVM took 464 times longer to complete for all 5 runs when compared to LR. Being that both performed near identical in accuracy, we conclude LR would be more appropriate for this dataset in terms of time and resources required to complete the test.

Interpret Feature Importance

Weights depect the association between each of the predictors and what we are trying to predict - will it rain in Australia tomorrow. However, in LR and SVM, it would be tempting to conclude that variables with larger weights/coefficients are more important because it points to more signficant chances of it rains or not. Before reaching a conclusion, it is important to identify whether all the predictors are in the same units - which was not for this data set. In reality, this would be comparing apples to oranges. So prior to moving forward, it was important to standardize all the predictors. Here are the graphical results for LR and SVM:

In [16]:
ply.offline.iplot(fig)
ply.offline.iplot(fig_svm)

Judging the two charts, both LR and SVM had identical predictors for its top 4:

  • Top 4
    • Humidity3pm
    • Pressure9am
    • WindGustSpeed
    • Pressure3pm

Features close to 0 are not strong predictors of rain while the ones closer to the absolute value of 1 are considered strong predictors. This actually makes real world sense. When watching the weather news report, meteorologist often point to this predictors when reporting on weather systems leading up to rain in the forecast. It is nice to see that both LR and SVM models produced similiar results. The remaining predictors in the between had different weights, thus ranked in different order.

In general, It seems intuitive to conclude that variables with larger coefficients (weights) are more important because they point to more significant changes in rain. However, because the features are all in different units, one needs to standardized the regression coefficients. In both cases of LR and SVM, the features are standardized - which represent the mean change in the response given a one standard deviation change in the predictor. The top 4 features listed above have the highest absolute value weights thus making those vairable more important than the rest.

Interpret Support Vectors

When completing the Support Vector Machine analysis, we did not utilize tochastic gradient descent. Although the time required to complete the SVM was longer than the logistic regression, it was not so much that it became unmanagable. With a dataset of a size larger that this one, SGD would have been applied. We did, however, subset our data in the process in order to create a random sample of the data for the test and train set.

The chosen support vectors for some of the variables are shown visually below as Pressure9am, Pressure3pm, and Humidity3pm. The Pressure (9am and 3pm) are very similar to eachother in their prediction. However, when looking at the data, we can see that high pressure in the morning follwed by low pressure during the afternoon significantly contributes to a prediction of rainfall the following day. A high humidity at 3pm also contributes to the likihood of rain tomorrow.

In [17]:
#Make a dataframe of the training data from the last iteration above.
df_tested_on = svm_rainfall.iloc[train_indices]

#Get the support vectors from the trained model.
df_support = df_tested_on.iloc[svm_clf.support_,:]

#Add response variable back to expalantory datasets.
df_support['RainTomorrow'] = y[svm_clf.support_]
svm_rainfall['RainTomorrow'] = y

#Group observations by the response variable, 1's together and 0's together.
df_grouped_support = df_support.groupby(['RainTomorrow'])
df_grouped = svm_rainfall.groupby(['RainTomorrow'])

#Plot kernal desity extimation of variables with weights higher than the user defined value set above, 0.5 in this case.
vars_to_plot = svm_voi

for v in vars_to_plot:
    plt.figure(figsize=(10,4))
    #Plot support vector stats.
    plt.subplot(1,2,1)
    ax = df_grouped_support[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Instances chosen as Support Vectors)')
    
    #Plot original distributions
    plt.subplot(1,2,2)
    ax = df_grouped[v].plot.kde() 
    plt.legend(['no rain','rained'])
    plt.title(v+' (Original)')
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [ ]: